library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(caret)
library(plotly)
library(randomForest)
library(nnet)

First, some more EDA - Seagrass species in the Great Barrier Reef, 1984-2018

We will now focus our attention on seagrass, and aim to build a classification algorithm to predict presence or absence of any seagrass species in the Great Barrier Reef. We will use location, types of sediment, and types of seabed to predict the presence of four seagrass species: Cymodocea serrulata, Syringodium isoetifolium, Thalassia hemprichii, and Zostera muelleri (subspecies capricorni).

# read data
setwd("cleaned_data")
seagrass <- read.csv("seagrass_classification_data.csv", as.is =TRUE)

# factorize relevant variables
seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)

# view data
head(seagrass)
##      SPECIES  LATITUDE LONGITUDE DEPTH SEDIMENT      TIDAL
## 1 Z_CAPIRCOR -23.65857  151.1206     0      Mud intertidal
## 2 Z_CAPIRCOR -23.65840  151.1212     0      Mud intertidal
## 3 Z_CAPIRCOR -23.65898  151.1203     0      Mud intertidal
## 4 Z_CAPIRCOR -23.65970  151.1197     0      Mud intertidal
## 5 Z_CAPIRCOR -23.65884  151.1205     0      Mud intertidal
## 6 Z_CAPIRCOR -23.65993  151.1197     0      Mud intertidal
summary(seagrass)
##        SPECIES         LATITUDE        LONGITUDE         DEPTH        
##  C_SERRULAT: 1256   Min.   :-24.20   Min.   :142.5   Min.   : 0.0000  
##  S_ISOETIFO:  101   1st Qu.:-23.79   1st Qu.:146.8   1st Qu.: 0.0000  
##  T_HEMPRICH:  823   Median :-22.41   Median :150.5   Median : 0.0000  
##  Z_CAPIRCOR:10201   Mean   :-21.06   Mean   :149.0   Mean   : 0.9253  
##                     3rd Qu.:-19.17   3rd Qu.:151.3   3rd Qu.: 1.2279  
##                     Max.   :-10.75   Max.   :151.9   Max.   :29.3583  
##                                                                       
##    SEDIMENT           TIDAL     
##  Gravel:   1   intertidal:7433  
##  Mud   :8969   subtidal  :4948  
##  Reef  :  63                    
##  Rock  :  66                    
##  Rubble: 107                    
##  Sand  :3151                    
##  Shell :  24

First we plot seagrass according to location (latitude and longitude). Then we will add a third axis (water depth) to visualize this in 3-dimensions using the plotly package. Since depth is measured in meters below sea level, we visualize this in negative values.

seagrass %>%
  ggplot() +
  geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES)) +
  ggtitle("Australia, Northeast Coast (Great Barrier Reef)")

plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")

We can also use histograms to see how our categorical predictors, sediment and seabed type, relate to the number of observed seagrass species:

seagrass %>% 
  ggplot(aes(SEDIMENT, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

seagrass %>% 
  ggplot(aes(TIDAL, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

Random Forest

Our first attempt at the classifier (to predict presence of any seagrass species) will use random forest. We will first partition the data set into a training and testing set. Since we have over 12,000 observations, we can allocate 75% of the data to the training set. We will fit the model using 500 trees and 2 variables to split at each node (~p/4).

# train-test split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)

train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]

# fit RF using relevant features, with mtry~p/3
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL, 
                       data=train_set, mtry = 2)

# build vector of predictions
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")

# view performance metrics
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
## 
##             true
## pred         C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        229         11         21         18
##   S_ISOETIFO          5          7          3          1
##   T_HEMPRICH          4          2        174          0
##   Z_CAPIRCOR         76          5          7       2531
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9505          
##                  95% CI : (0.9423, 0.9579)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8291          
##                                           
##  Mcnemar's Test P-Value : 5.783e-11       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.72930          0.280000           0.84878
## Specificity                    0.98201          0.997067           0.99792
## Pos Pred Value                 0.82079          0.437500           0.96667
## Neg Pred Value                 0.96980          0.994152           0.98936
## Prevalence                     0.10149          0.008080           0.06626
## Detection Rate                 0.07401          0.002262           0.05624
## Detection Prevalence           0.09017          0.005171           0.05818
## Balanced Accuracy              0.85566          0.638534           0.92335
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9925
## Specificity                     0.8382
## Pos Pred Value                  0.9664
## Neg Pred Value                  0.9600
## Prevalence                      0.8242
## Detection Rate                  0.8180
## Detection Prevalence            0.8465
## Balanced Accuracy               0.9154

We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR, which have 85%+ sensitivity and specificity. However, this model got quite a low sensitivity for S_ISOETIFO. Recall from our data wrangling process that S_ISOETIFO had only about 100 “Yes” observations. Since we had such low variation in the values of S_ISOETIFO relative to the other three seagrass species, this may have contributed to the low sensitivity.

We can visually assess our predicted values with true values of species to see how our model performed.

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")

These plots look fairly similar across the training and test sets, for all four species.

Below we see that across sediment types, the species predictions (rf_pred) appear very similar to the true species distribution (SPECIES). The same is true across seabed types, in the two bar graphs further below.

test_set %>% 
  ggplot(aes(SEDIMENT, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(SEDIMENT, fill=rf_pred)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=rf_pred)) +
  geom_bar(width=.5, position = "dodge")

Finally, we examine variable importance using the Gini coefficient:

variable_importance <- importance(rf_fit)

tmp <- data_frame(feature = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))

tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) + 
  geom_bar(stat='identity') +
  coord_flip() + xlab("Feature") +
  theme(axis.text=element_text(size=8))

We see that longitude and latitude were very predictive of presence of seagrass, followed by water depth. The types of sediment and seabed are not very important predictors. Thus, it seems that the location of where the sea grass was discovered matters more than the various ocean floor properties.

Multinomial logistic regression

We will now try a multinomial logistic regression model to see how it compares to the random forest we fit above. We will use the nnet package. The logistic regression model is as follows:

# fit model
multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights:  44 (30 variable)
## initial  value 12874.515732 
## iter  10 value 3238.252953
## iter  20 value 2574.820181
## iter  30 value 2476.580036
## iter  40 value 2412.749553
## iter  50 value 2398.326504
## iter  60 value 2366.750453
## iter  70 value 2366.650134
## iter  80 value 2366.518268
## iter  90 value 2364.884233
## iter 100 value 2364.752964
## final  value 2364.752964 
## stopped after 100 iterations
summary(multinom_fit)
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, 
##     data = train_set)
## 
## Coefficients:
##            (Intercept)  LATITUDE LONGITUDE        DEPTH SEDIMENTMud
## S_ISOETIFO   -410.4873 2.0669651  2.665179  0.006176328    53.94191
## T_HEMPRICH   -171.5170 1.9202662  1.294884 -0.329414604    13.61520
## Z_CAPIRCOR   -278.8529 0.7096458  1.510618 -0.784794642    72.97421
##            SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO     54.99932     54.52452      -20.67969     54.77416      55.67214
## T_HEMPRICH     16.91114     17.86782       18.53850     16.89500      16.44980
## Z_CAPIRCOR     69.07350     70.34708       69.10061     72.10102      71.44704
## 
## Std. Errors:
##            (Intercept)   LATITUDE   LONGITUDE      DEPTH SEDIMENTMud
## S_ISOETIFO 0.002560385 0.05895299 0.008095859 0.02446157   0.4020666
## T_HEMPRICH 0.001435543 0.05912903 0.007588627 0.04259597   0.3336095
## Z_CAPIRCOR 0.002651486 0.02880042 0.004656218 0.02864185   0.3578270
##            SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO    0.6400785    0.9298980   4.975118e-14    0.3747139     0.9600996
## T_HEMPRICH    0.5702821    0.6293914   4.956410e-01    0.3011637     1.1222749
## Z_CAPIRCOR    0.9370497    0.7180181   1.132460e+00    0.3595152     0.9014128
## 
## Residual Deviance: 4729.506 
## AIC: 4789.506
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")

# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")

# classification performance metrics
confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
##   C_SERRULAT        131          9         10         27
##   S_ISOETIFO          0          0          0          0
##   T_HEMPRICH         17          4        177         21
##   Z_CAPIRCOR        166         12         18       2502
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9082          
##                  95% CI : (0.8975, 0.9182)
##     No Information Rate : 0.8242          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6611          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity                    0.41720           0.00000           0.86341
## Specificity                    0.98345           1.00000           0.98546
## Pos Pred Value                 0.74011               NaN           0.80822
## Neg Pred Value                 0.93726           0.99192           0.99026
## Prevalence                     0.10149           0.00808           0.06626
## Detection Rate                 0.04234           0.00000           0.05721
## Detection Prevalence           0.05721           0.00000           0.07078
## Balanced Accuracy              0.70033           0.50000           0.92444
##                      Class: Z_CAPIRCOR
## Sensitivity                     0.9812
## Specificity                     0.6397
## Pos Pred Value                  0.9274
## Neg Pred Value                  0.8788
## Prevalence                      0.8242
## Detection Rate                  0.8087
## Detection Prevalence            0.8720
## Balanced Accuracy               0.8104

We see that our multinomial logistic model has about 91% overall accuracy, which indicates lower performance than the random forest. The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. It peforms well for Z_CAPRICOR as well, but performs relatively poorly for C_SERRULAT and even worse for S_ISOETIFO with 0% sensitivity. Again, we can assess our results visually:

# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicted values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~predicted_class, type="scatter3d", mode="markers")

In this case, the truth and prediction plots look different, particularly in that the prediction plot does not appear to show any instances of S_ISOETIFO. This makes sense, given the small sample size and low variability in the values of the S_ISOETIFO variable.

Below we see that across sediment types, the species predictions (rf_pred) appear fairly similar to the true species distribution (SPECIES). This is not true across seabed types, in the two bar graphs further below - there we see that S_ISOETIFO is never predicted, and C_SERRULAT is predicted less often than in the truth.

test_set %>% 
  ggplot(aes(SEDIMENT, fill=SPECIES)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(SEDIMENT, fill=predicted_class)) +
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=SPECIES)) + 
  geom_bar(width=.5, position = "dodge")

test_set %>% 
  ggplot(aes(TIDAL, fill=predicted_class)) + 
  geom_bar(width=.5, position = "dodge")

The overall accuracy for the multinomial logistic regression model was 90.9%, and that of the random forest model was 95.6%. While these may seem fairly close, the accuracy for the multinomial logistic model is deceiving since it performed particularly poorly in terms of sensitivity in 2 out of 4 classes.

Conclusions

With approximately 96% overall accuracy, the random forest model performed better than the multinomial logistic regression model (with 91% accuracy) in predicting the species of a seagrass observation based on several environmental inputs. The best predictors of seagrass presence were latitude, longitude, and depth below sea level. Prediction of seagrass species was more closely related geographic location (latitude and longitude) than sediment type or tidal zone. However, despite the high overall accuracy of both models, insufficient observations of S. isoetifolium presence led to low performance of both models for predicting the presence of this seagrass species.

Knowledge gained from this model could be helpful to better understand which seagrass species are present in specific locations over time, which may be an important indicator of reef health. Seagrass ecology is a marker of plant diversity and presence or absence of specific species may change with environmental changes in the reef over time such as climate change. Over time, sea water temperature could lead to changes in seagrass ecology by altering tidal zones and sediment type, two predictors in our models. Even the information represented by geographic location could change over time, if the location variable reflects the combined effects air and water temperature, water quality, and ocean and air currents.